D.15 R FOR STATISTICS
Approximate Cost: Free
Source: http://www.r-project.org
Operating System Needs: Operates on Windows, Mac OS, and most versions of UNIX.
Input Structure: Scripts can be written in R to read and analyze data from a wide variety of data sources including, but not limited to text/binary files, spreadsheets, and databases.
Overview
According to the R FAQ (Hornik 2013), "R is a system for statistical computation and graphics consisting of a programming language plus a run-time environment with graphics, a debugger, access to certain system functions, and the ability to run programs stored in script files." "R is an integrated suite of software facilities for data manipulation, calculation, and graphical display" according to the “An Introduction to R” document (Venables et al. 2013).
The statistical functions in R provide support for linear and generalized linear models, nonlinear regression models, time series analysis, classical parametricA statistical test that depends upon or assumes observations from a particular probability distribution or distributions (Unified Guidance). and nonparametricStatistical test that does not depend on knowledge of the distribution of the sampled population (Unified Guidance). tests, clustering and smoothing, analysis of spatial data, and Bayesian analysis, among others. In addition to storing and manipulating data, a mature collection of functions help in the production of report-quality graphics. R can be downloaded free from the Comprehensive R archive network (CRAN; http://www.r-project.org/). It is distributed under a GNU-style copyleft (http://www.gnu.org/copyleft/copyleft.html) license and is part of the GNU project (http://www.gnu.org).
Functions and corresponding data sets are typically organized in units called ‘packages’. The directory where packages are stored is called the library. R comes with a standard set of packages in the standard library. Other packages can be downloaded and installed as needed. Once installed, these packages must be loaded into the session to be used. The list of packages in the standard library and detailed descriptions and documentation for each of the packages can be found at http://stat.ethz.ch/R-manual/R-devel/library/base/html/00Index.html. In addition to the standard packages, the user can install additional packages from the CRAN website or elsewhere. Additional contributed packages can be found at the CRAN website at http://CRAN.R-project.org/ and related sites such as Bioconductor (http://www.bioconductor.org/) and Omegahat (http://www.omegahat.org/). Advanced users can program their own packages for custom applications.
Statistical Method |
Capability As Is |
Capability with Scripts/Add-Ins |
---|---|---|
Handling of NDs |
|
|
● |
N/A |
|
◒ |
● |
|
◒ |
● |
|
◒ |
● |
|
Exploratory/Diagnostic Tools |
|
|
Summary Statistics |
● |
N/A |
● |
N/A |
|
● |
N/A |
|
Data transformations |
● |
N/A |
Statistical Design |
|
|
Statistical Power |
● |
N/A |
● |
N/A |
|
Contaminant ranking |
● |
N/A |
|
◒ |
|
Statistical Limits |
|
|
● |
N/A |
|
● |
N/A |
|
● |
N/A |
|
Testing Compliance Limits |
● |
N/A |
Graphics |
|
|
Plots/Charts |
● |
N/A |
Batch plots |
● |
N/A |
Tweaking of graphics |
● |
N/A |
Statistical Comparisons |
|
|
● |
N/A |
|
● |
N/A |
|
Spatial Analysis |
|
|
Geostatistics/Mapping |
◒ |
● |
◒ |
● |
|
◒ |
● |
|
Regression/Time Series |
|
|
● |
N/A |
|
● |
N/A |
|
● |
N/A |
|
● |
N/A |
|
● |
N/A |
|
● |
N/A |
|
Multivariate Analysis |
|
|
Multiple regression |
● |
N/A |
Factor/Discriminant analysis |
● |
N/A |
● |
N/A |
Capability Ratings:
N/A = Not applicable or not available
● = Full capability
◒ = Some capability
(blank cell) = No capability
Add-Ins Available
Several existing add-on packages extend the functionality of R. A partial list can be found at http://cran.r-project.org/doc/FAQ/R-FAQ.html●Add_002don-packages-from-CRAN.
Ease of Use and Data Import
The most common data structures in R are vectors and data frames. Higher order data structures such as lists and data frames are also available for advanced analysis. The R environment may challenge a new user; however, an interactive user interface and comprehensive help documentation are provided. In addition, active development is underway to generate graphical user interfaces that provide a method to access commonly used functions.
Types of Distributions
R can be used for calculating properties of probability distributions as well as to check whether a given data set fits a standard distribution. A number of distributions and distributional tests are supported in R, including: beta, binomial, Cauchy, chi-squared, exponential, F, gammaA gamma distribution or data set. A parametric unimodal distribution model commonly applied to groundwater data where the data set is left skewed and tied to zero. Very similar to Weibull and lognormal distributions; differences are in their tail behavior, and the gamma density has the second longest tail where its coefficient of variation is less than 1 (Unified Guidance; Gilbert 1987; Silva and Lisboa 2007)., geometric, hypergeometric, lognormalA dataset that is not normally distributed (symmetric bell-shaped curve) but that can be transformed using a natural logarithm so that the data set can be evaluated using a normal-theory test (Unified Guidance)., logistic, negative binomial, normal, Poisson, Student’s T, uniform, and Weibull.
Visualization
R has a mature graphics library and can produce presentation quality graphics for most of the commonly used plots, such as stem and leaf, box plots, scatter plotsGraphical representation of multiple observations from a single point used to illustrate the relationship between two or more variables. An example would be concentrations of one chemical on the x-axis and a second chemical on the y-axis. They are a typical exploratory data analysis tool to identify linear versus nonlinear relationships between variables (Unified Guidance).,histograms, and contours.
Primary Uses for Groundwater Data Analysis
R is commonly used to perform the following tasks:
- calculate summary statistics
- perform distributional tests
- get point estimates of population meanThe arithmetic average of a sample set that estimates the middle of a statistical distribution (Unified Guidance).
- get interval estimates of population mean with known and unknown varianceThe square of the standard deviation (EPA 1989); a measure of how far numbers are separated in a data set. A small variance indicates that numbers in the dataset are clustered close to the mean.
- perform sampling size of population mean
- calculate point and interval estimates of population proportion
- test hypotheses
- perform linear and nonlinear regression
- perform analysis on time-series and spatial data
- snalyze nondetectsLaboratory analytical result known only to be below the method detection limit (MDL), or reporting limit (RL); see "censored data" (Unified Guidance). in data using substitution-type methods and also more advanced maximum likelihood estimator methods
- develop custom applications
Benefits
- provides a flexible, interactive, and powerful environment for data analysis and visualization
- free
- built-in support for a variety of simple to the complex statistical analyses
- scripts for performing complex analysis
- easily produces presentation-quality graphics and automated reports
- active and knowledgeable online community for support issues.
- detailed online documentation
Limitations and Data Requirements
- The program provides the functions and libraries to read and process data from a variety of sources including, but not limited to ASCII Files, binary Files, spreadsheets, and databases.
- As long as the data format and structure is known, data can be imported into the R environment.
- The environment challenging to the first-time user, and presents a steep initial learning curve.
References
Faraway, J. 2002. Practical Regression and ANOVA Using R. http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf.
Hornik, K. 2013. The R FAQ. http://CRAN.R-project.org/doc/FAQ/R-FAQ.html.
R Development Core Team. 2008. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. http://www.r-project.org.
Venables W.N., D.M. Smith, and the R Core Team. 2013. An Introduction to R. Notes on R: A Programming Environment for Data Analysis and Graphics. Version 3.0.1.
Publication Date: December 2013